Author

Martin Schweinberger

Published

January 1, 2026

Introduction

This tutorial introduces computational lexicography with R and shows how to use R to create dictionaries, find synonyms, and generate bilingual translation lexicons through statistical analysis of corpus data. While the initial examples focus on English, subsequent sections demonstrate how the approach generalises to other languages — including German — using the udpipe package, which supports more than 60 languages.

Traditionally, dictionaries are listings of words arranged alphabetically, providing information on definitions, usage, etymologies, pronunciations, translations, and related forms (Agnes, Goldman, and Soltis 2002; Steiner 1985). Computational lexicology is the branch of computational linguistics concerned with the computer-based study of lexicons and machine-readable dictionaries (Amsler 1981). Computational lexicography, the focus of this tutorial, is the use of computers in the construction of dictionaries. Although the two terms are sometimes used interchangeably, the distinction between studying a lexicon and building one is conceptually important.

The tutorial is structured around three increasingly complex tasks: (1) generating a basic annotated dictionary from corpus text using part-of-speech tagging; (2) identifying synonym candidates using distributional semantics and cosine similarity; and (3) building a bilingual translation lexicon from parallel text using co-occurrence statistics.

Learning Objectives

By the end of this tutorial you will be able to:

  1. Generate a basic annotated dictionary from corpus text using part-of-speech tagging with udpipe
  2. Correct, extend, and enrich dictionary entries with additional layers of information (sentiment, comments)
  3. Build a term-document matrix from corpus co-occurrence data
  4. Compute Positive Pointwise Mutual Information (PPMI) and cosine similarity between items
  5. Use hierarchical clustering to visualise semantic similarity among words
  6. Extract synonym candidates automatically from a cosine similarity matrix
  7. Create a bilingual translation lexicon from parallel text using contingency-based association measures
  8. Apply the same workflow to languages other than English using multilingual udpipe models
Citation

Martin Schweinberger. 2026. Lexicography with R. The Language Technology and Data Analysis Laboratory (LADAL), The University of Queensland, Australia. url: https://ladal.edu.au/tutorials/lex/lex.html (Version 2026.05.01).

@manual{martinschweinberger2026lexicography,
  author       = {Martin Schweinberger},
  title        = {Lexicography with R},
  year         = {2026},
  note         = {https://ladal.edu.au/tutorials/lex/lex.html},
  organization = {The Language Technology and Data Analysis Laboratory (LADAL), The University of Queensland, Australia},
  edition      = {2026.05.01},
  doi          = {}
}

Preparation and Session Set-up

Install required packages once:

Code
install.packages("dplyr")
install.packages("stringr")
install.packages("udpipe")
install.packages("tidytext")
install.packages("tidyr")
install.packages("coop")
install.packages("flextable")
install.packages("textdata")
install.packages("plyr")
install.packages("checkdown")

Load packages for this session:

Code
library(checkdown)   # interactive exercises
library(dplyr)       # data manipulation
library(stringr)     # string processing
library(udpipe)      # part-of-speech tagging (60+ languages)
library(tidytext)    # text mining and sentiment lexicons
library(tidyr)       # data reshaping
library(coop)        # cosine similarity
library(flextable)   # formatted tables
library(plyr)        # join operations for parallel data

Creating Dictionaries

Section Overview

What you will learn: How to use part-of-speech tagging to generate a structured dictionary from raw corpus text, and how to extend and enrich dictionary entries with sentiment information.

Key tools: udpipe for multilingual tagging, tidytext for sentiment lexicons, dplyr for table manipulation.

Loading and tagging the corpus text

In a first step, we load a text. We use George Orwell’s Nineteen Eighty-Four as the source text for our English dictionary.

Code
text <- readLines("tutorials/lex/data/orwell.txt") |>
  paste0(collapse = " ")
# show the first 500 characters of the text
substr(text, start = 1, stop = 500)
[1] "1984 George Orwell Part 1, Chapter 1 It was a bright cold day in April, and the clocks were striking thirteen. Winston Smith, his chin nuzzled into his breast in an effort to escape the vile wind, slipped quickly through the glass doors of Victory Mansions, though not quickly enough to prevent a swirl of gritty dust from entering along with him. The hallway smelt of boiled cabbage and old rag mats. At one end of it a coloured poster, too large for indoor display, had been tacked to the wall. It "

Next, we download a udpipe language model for English. The udpipe package supports more than 60 languages, making this approach directly transferable to other research contexts.

Code
# download English language model (run once, then use lex2 to load from disk)
m_eng <- udpipe::udpipe_download_model(language = "english-ewt")

Once downloaded, load the model directly from disk:

Code
# load language model from disk
m_eng <- udpipe_load_model(file = here::here("udpipemodels", "english-ewt-ud-2.5-191206.udpipe"))

We now apply the part-of-speech tagger to the full text. udpipe_annotate() returns a data frame with one row per token, including token form, lemma, universal POS tag, and dependency information:

Code
# tokenise, tag, and parse
text_ann <- udpipe::udpipe_annotate(m_eng, x = text) |>
  as.data.frame() |>
  dplyr::select(
    -sentence, -paragraph_id, -sentence_id, -feats,
    -head_token_id, -dep_rel, -deps, -misc
  )
# inspect
head(text_ann, 10)
   doc_id token_id   token   lemma  upos xpos
1    doc1        1    1984    1984 PROPN  NNP
2    doc1        2  George  George PROPN  NNP
3    doc1        3  Orwell  Orwell PROPN  NNP
4    doc1        4    Part    part PROPN  NNP
5    doc1        5       1       1   NUM   CD
6    doc1        6       ,       , PUNCT    ,
7    doc1        7 Chapter chapter PROPN  NNP
8    doc1        8       1       1   NUM   CD
9    doc1        1      It      it  PRON  PRP
10   doc1        2     was      be   AUX  VBD

Generating the basic dictionary

We use the annotated data to generate a first, basic dictionary holding the word form (token), the part-of-speech tag (upos), the lemmatised word type (lemma), and the frequency with which that word form is used as that part-of-speech in the corpus. We begin by arranging entries by frequency, which is useful for spotting the most important vocabulary items quickly.

Code
text_dict_raw <- text_ann |>
  # remove non-word tokens (punctuation, symbols)
  dplyr::filter(!stringr::str_detect(token, "\\W")) |>
  # remove numeric tokens
  dplyr::filter(!stringr::str_detect(token, "[0-9]")) |>
  dplyr::group_by(token, lemma, upos) |>
  dplyr::summarise(frequency = dplyr::n(), .groups = "drop") |>
  dplyr::arrange(-frequency)
# inspect
head(text_dict_raw, 10)
# A tibble: 10 × 4
   token lemma upos  frequency
   <chr> <chr> <chr>     <int>
 1 the   the   DET        5249
 2 of    of    ADP        2908
 3 a     a     DET        2277
 4 and   and   CCONJ      2064
 5 was   be    AUX        1795
 6 in    in    ADP        1446
 7 to    to    PART       1336
 8 it    it    PRON       1295
 9 he    he    PRON       1270
10 had   have  AUX        1018

Dictionary conventions call for alphabetical ordering. We can switch to that with a single arrange() call:

Code
text_dict <- text_dict_raw |>
  dplyr::arrange(token)
# inspect
head(text_dict, 10)
# A tibble: 10 × 4
   token     lemma     upos  frequency
   <chr>     <chr>     <chr>     <int>
 1 A         a         DET         107
 2 A         a         NOUN          1
 3 AND       and       CCONJ         2
 4 Aaronson  Aaronson  PROPN         8
 5 About     about     ADV           4
 6 Above     above     ADP           2
 7 Abruptly  abruptly  ADV           2
 8 Actually  actually  ADV          13
 9 Adam      Adam      PROPN         1
10 Admission admission NOUN          1
Tagging Accuracy and Manual Post-Editing

POS tagging is not perfect — some tokens will receive incorrect tags and some lemmas will be wrong. Even state-of-the-art taggers reach around 95–97% accuracy on standard text, which means visible errors are inevitable at this scale. The resulting dictionary requires manual review before publication. However, the computational workflow dramatically reduces the effort needed to produce a first draft: instead of generating thousands of entries from scratch, the researcher begins with a near-complete list and corrects errors rather than creating every entry.

Correcting and extending dictionary entries

One of the advantages of keeping dictionaries in R as data frames is that entries are easy to correct and extend programmatically. Below we demonstrate removing a spurious entry, correcting a POS tag, and adding an annotation column with custom notes.

Code
text_dict_ext <- text_dict |>
  # remove spurious entry: 'a' tagged as NOUN
  dplyr::filter(!(lemma == "a" & upos == "NOUN")) |>
  # correct POS tag: 'aback' should be PREP, not NOUN
  dplyr::mutate(upos = ifelse(lemma == "aback" & upos == "NOUN", "PREP", upos)) |>
  # add custom comments
  dplyr::mutate(comment = dplyr::case_when(
    lemma == "a"        ~ "also 'an' before vowels",
    lemma == "Aaronson" ~ "name of a character in the novel",
    TRUE                ~ ""
  ))
# inspect
head(text_dict_ext, 10)
# A tibble: 10 × 5
   token     lemma     upos  frequency comment                           
   <chr>     <chr>     <chr>     <int> <chr>                             
 1 A         a         DET         107 "also 'an' before vowels"         
 2 AND       and       CCONJ         2 ""                                
 3 Aaronson  Aaronson  PROPN         8 "name of a character in the novel"
 4 About     about     ADV           4 ""                                
 5 Above     above     ADP           2 ""                                
 6 Abruptly  abruptly  ADV           2 ""                                
 7 Actually  actually  ADV          13 ""                                
 8 Adam      Adam      PROPN         1 ""                                
 9 Admission admission NOUN          1 ""                                
10 Africa    Africa    PROPN        10 ""                                

Adding sentiment information

To make the dictionary more informative, we enrich each entry with sentiment information from the tidytext package. We use the Bing Liu lexicon (liu2012sentiment?), which classifies words as positive or negative.

Code
text_dict_snt <- text_dict_ext |>
  dplyr::mutate(word = lemma) |>
  dplyr::left_join(get_sentiments("bing"), by = "word") |>
  dplyr::group_by(token, lemma, upos, comment) |>
  dplyr::summarise(
    sentiment = paste0(unique(sentiment[!is.na(sentiment)]), collapse = ", "),
    .groups = "drop"
  )
# inspect
head(text_dict_snt, 10)
# A tibble: 10 × 5
   token     lemma     upos  comment                            sentiment 
   <chr>     <chr>     <chr> <chr>                              <chr>     
 1 A         a         DET   "also 'an' before vowels"          ""        
 2 AND       and       CCONJ ""                                 ""        
 3 Aaronson  Aaronson  PROPN "name of a character in the novel" ""        
 4 About     about     ADV   ""                                 ""        
 5 Above     above     ADP   ""                                 ""        
 6 Abruptly  abruptly  ADV   ""                                 "negative"
 7 Actually  actually  ADV   ""                                 ""        
 8 Adam      Adam      PROPN ""                                 ""        
 9 Admission admission NOUN  ""                                 ""        
10 Africa    Africa    PROPN ""                                 ""        

The resulting extended dictionary now contains the token, lemma, POS tag, comment, and sentiment label — a richer lexical resource than the basic dictionary we started with, and one generated entirely automatically from corpus data.


Exercises: Creating Dictionaries

Q1. What is the difference between computational lexicology and computational lexicography?





Q2. After POS tagging, you notice that the word ‘run’ is sometimes tagged as VERB and sometimes as NOUN. Which dplyr approach is most appropriate to correct a specific erroneous tag?






Finding Synonyms: Creating a Thesaurus

Section Overview

What you will learn: How to use distributional semantics — co-occurrence statistics, PPMI weighting, and cosine similarity — to identify synonym candidates for a set of degree adverbs.

Key concepts: Term-document matrix, Pointwise Mutual Information (PMI), Positive PMI (PPMI), cosine similarity, hierarchical clustering.

Why distributional methods? The basic assumption of distributional semantics is that words occurring in the same contexts tend to have similar meanings — the distributional hypothesis (Firth 1957). PPMI-weighted cosine similarity has been shown to outperform raw co-occurrence counts for semantic similarity tasks (Bullinaria and Levy 2007; Levshina 2015).

Another key task in lexicography is determining semantic relationships between words — in particular, whether two words are synonymous. In computational linguistics, such relationships are typically determined from collocational profiles, also called word vectors or word embeddings.

In this example, we investigate whether a set of degree adverbs (very, really, so, completely, totally, etc.) are synonymous — that is, whether they can be exchanged without substantially changing the meaning of the sentence. This is directly relevant to lexicography: if two adverbs have similar collocational profiles, a dictionary can link them as synonyms or near-synonyms.

Loading the degree adverb data

The dataset contains three columns: a pint column with the degree adverb, an adjs column with the adjective it modifies, and a remove column we do not need.

Code
degree_adverbs <- base::readRDS("tutorials/lex/data/dad.rda", "rb") |>
  dplyr::select(-remove) |>
  dplyr::rename(
    degree_adverb = pint,
    adjective     = adjs
  ) |>
  dplyr::filter(
    degree_adverb != "0",   # remove unmodified adjectives
    degree_adverb != "well" # 'well' behaves differently
  )
# inspect
head(degree_adverbs, 10)
   degree_adverb adjective
1           real       bad
2         really      nice
3           very      good
4         really     early
5         really       bad
6         really       bad
7             so      long
8         really wonderful
9         pretty      good
10        really      easy

Building the term-document matrix

We construct a term-document matrix (TDM) showing how often each degree adverb co-occurred with each adjective. Rows are adjectives; columns are degree adverbs; each cell contains the co-occurrence count.

Code
# create term-document matrix
tdm <- ftable(degree_adverbs$adjective, degree_adverbs$degree_adverb)
# extract dimension names
amplifiers <- as.vector(unlist(attr(tdm, "col.vars")[1]))
adjectives <- as.vector(unlist(attr(tdm, "row.vars")[1]))
# attach names
rownames(tdm) <- adjectives
colnames(tdm) <- amplifiers
# inspect
tdm[1:5, 1:5]
          completely extremely pretty real really
able               0         1      0    0      0
actual             0         0      0    1      0
amazing            0         0      0    0      4
available          0         0      0    0      1
bad                0         0      1    2      3

Computing PPMI and cosine similarity

Raw co-occurrence counts are biased towards frequent words. Pointwise Mutual Information (PMI) corrects for this by comparing observed co-occurrence frequency to what would be expected if the two words were independent. Positive PMI (PPMI) replaces all negative PMI values with zero, which improves performance on semantic similarity tasks (Bullinaria and Levy 2007; Levshina 2015).

We then compute cosine similarity between the PPMI vectors of each degree adverb. Cosine similarity ranges from 0 (no shared context) to 1 (identical context profile).

Code
# compute expected values under independence
tdm.exp <- chisq.test(tdm)$expected
# calculate PMI and PPMI
PMI  <- log2(tdm / tdm.exp)
PPMI <- ifelse(PMI < 0, 0, PMI)
# calculate cosine similarity between amplifier vectors
cosinesimilarity <- cosine(PPMI)
# inspect
cosinesimilarity[1:5, 1:5]
           completely extremely   pretty    real   really
completely    1.00000  0.204189 0.000000 0.05304 0.126668
extremely     0.20419  1.000000 0.007319 0.00000 0.004235
pretty        0.00000  0.007319 1.000000 0.09441 0.062323
real          0.05304  0.000000 0.094413 1.00000 0.131957
really        0.12667  0.004235 0.062323 0.13196 1.000000

Visualising clusters with a dendrogram

We convert the cosine similarity matrix to a distance matrix and apply Ward’s hierarchical clustering to visualise the similarity structure.

Code
# find maximum similarity value that is not 1 (self-similarity)
cosinesimilarity.test <- apply(cosinesimilarity, 1, function(x) {
  x <- ifelse(x == 1, 0, x)
})
maxval <- max(cosinesimilarity.test)
# convert similarity to distance
amplifier.dist <- 1 - (cosinesimilarity / maxval)
clustd <- as.dist(amplifier.dist)
Code
# hierarchical clustering with Ward's method
cd <- hclust(clustd, method = "ward.D")
# plot
plot(cd, main = "", sub = "", yaxt = "n", ylab = "", xlab = "", cex = .8)

The dendrogram reveals interpretable clusters. Completely, extremely, and totally form a cluster of strong, absolute intensifiers that are interchangeable with each other but not with milder adverbs. Real and really cluster together as colloquial variants. This structure matches what an experienced lexicographer would expect, and the method has recovered it automatically from corpus data.

Extracting synonym candidates

To extract synonyms automatically, we find the most similar adverb for each entry in the cosine similarity matrix: we replace diagonal values (each word’s perfect similarity to itself) with 0, then look up the column with the highest remaining value.

A Note on Syntactic Context

The synonym candidates here are based purely on collocational profile similarity. A complete synonym analysis would also consider syntactic context: very and so have similar profiles, but so is strongly disfavoured in attributive position (a so great tutorial is unusual, whereas a very great tutorial is fine). A full lexicographic treatment would require filtering by syntactic function before computing similarity.

Code
# build synonym table: replace self-similarity (1s) with 0
syntb <- cosinesimilarity |>
  as.data.frame() |>
  dplyr::mutate(word = colnames(cosinesimilarity)) |>
  dplyr::mutate(across(where(is.numeric), ~replace(., . == 1, 0)))
# extract the most similar item for each word
syntb <- syntb |>
  dplyr::mutate(synonym = colnames(syntb)[apply(syntb, 1, which.max)]) |>
  dplyr::select(word, synonym)
syntb
                 word    synonym
completely completely  extremely
extremely   extremely completely
pretty         pretty       real
real             real     really
really         really       real
so                 so       real
totally       totally completely
very             very         so

The results confirm the clustering: completely is paired with totally and vice versa, real is paired with really, and very is paired with pretty — consistent with both prior expectations and the dendrogram above.

For further reading on semantic vector space modelling, Rajeg, Denistia, and Musgrave (2019) provide an accessible introduction, and Levshina (2015) offers a comprehensive treatment of distributional methods for corpus linguists.


Exercises: Finding Synonyms

Q1. Why is Positive PMI (PPMI) preferred over raw PMI for computing semantic similarity?





Q2. In the dendrogram, completely, extremely, and totally form a tight cluster. What does this tell us lexicographically?






Creating Bilingual Dictionaries

Section Overview

What you will learn: How to generate a bilingual translation lexicon from parallel text using word co-occurrence statistics and contingency-based association measures.

Key concepts: Parallel corpus, sentence alignment, co-occurrence matrix, Fisher’s Exact Test, phi coefficient.

Why this matters: Data-driven translation lexicons can be generated for any language pair for which parallel data exists — including low-resource languages where commercial dictionaries are unavailable.

Translation dictionaries map words in one language to their counterparts in another. If a German word and an English word tend to co-occur across sentence-translation pairs, they are likely translations of each other. The quality of the result depends on the quantity and alignment quality of the parallel data, and grammatical differences between languages introduce additional challenges.

Loading parallel text

We load a sample of German sentences and their English translations. Each line contains a German sentence and its English translation, separated by the string (a spaced em dash).

Code
# load parallel translation data
translations <- readLines("tutorials/lex/data/translation.txt",
                          encoding = "UTF-8", skipNul = TRUE)

translations

Guten Tag! — Good day!

Guten Morgen! — Good morning!

Guten Abend! — Good evening!

Hallo! — Hello!

Wo kommst du her? — Where are you from?

Woher kommen Sie? — Where are you from?

Ich bin aus Hamburg. — I am from Hamburg.

Ich komme aus Hamburg. — I come from Hamburg.

Ich bin Deutscher. — I am German.

Schön Sie zu treffen. — Pleasure to meet you!

Wie lange lebst du schon in Brisbane? — How long have you been living in Brisbane?

Leben Sie schon lange hier? — Have you been living here for long?

Welcher Bus geht nach Brisbane? — Which bus goes to Brisbane?

Von welchem Gleis aus fährt der Zug? — Which platform is the train leaving from?

Ist dies der Bus nach Toowong? — Is this the bus going to Toowong?

Separating German and English sentences

We split the parallel data into two tables — one for German, one for English — each indexed by sentence number. The sentence index preserves the alignment between source and target sentences.

Code
# separate German and English, remove punctuation
german  <- stringr::str_remove_all(translations, " [-\u2014\u2013] .*") |>
           stringr::str_remove_all("[[:punct:]]")
english <- stringr::str_remove_all(translations, ".* [-\u2014\u2013] ") |>
           stringr::str_remove_all("[[:punct:]]")
sentence <- 1:length(german)
germantb  <- data.frame(sentence, german)
englishtb <- data.frame(sentence, english)

sentence

german

1

Guten Tag

2

Guten Morgen

3

Guten Abend

4

Hallo

5

Wo kommst du her

6

Woher kommen Sie

7

Ich bin aus Hamburg

8

Ich komme aus Hamburg

9

Ich bin Deutscher

10

Schön Sie zu treffen

11

Wie lange lebst du schon in Brisbane

12

Leben Sie schon lange hier

13

Welcher Bus geht nach Brisbane

14

Von welchem Gleis aus fährt der Zug

15

Ist dies der Bus nach Toowong

Creating word-level co-occurrence pairs

We tokenise the sentences into individual words and cross-join German and English tokens within each sentence. Each row of the result represents a German–English word pair that co-occurred in the same sentence translation unit.

Code
# tokenise German sentences
german_tokens <- germantb |>
  tidytext::unnest_tokens(word, german)

# join English sentences by sentence id, then tokenise English
transtb <- german_tokens |>
  dplyr::left_join(englishtb, by = "sentence") |>
  tidytext::unnest_tokens(trans, english) |>
  dplyr::rename(german = word, english = trans) |>
  dplyr::select(german, english) |>
  dplyr::mutate(
    german  = factor(german),
    english = factor(english)
  )

german

english

guten

good

guten

day

tag

good

tag

day

guten

good

guten

morning

morgen

good

morgen

morning

guten

good

guten

evening

abend

good

abend

evening

hallo

hello

wo

where

wo

are

Building the co-occurrence matrix

From the word-pair table we construct a co-occurrence matrix: rows are English words, columns are German words, and each cell is the count of how many times that German–English pair appeared in the same sentence pair.

Code
# construct term-document matrix
tdm <- ftable(transtb$german, transtb$english)
# extract dimension names
german  <- as.vector(unlist(attr(tdm, "col.vars")[1]))
english <- as.vector(unlist(attr(tdm, "row.vars")[1]))
# assign names
rownames(tdm) <- english
colnames(tdm) <- german
# inspect
tdm[1:10, 1:10]
         a accident all am ambulance an and any anything are
ab       0        0   0  0         0  0   0   0        0   0
abend    0        0   0  0         0  0   0   0        0   0
allem    0        0   0  0         0  0   0   0        0   0
alles    0        0   1  0         0  0   0   0        0   0
am       0        0   0  0         0  0   0   0        0   0
an       0        0   0  0         0  0   0   0        0   0
anderen  1        0   0  0         0  0   0   0        0   0
apotheke 1        0   0  1         0  0   0   0        0   0
arzt     1        0   0  0         0  0   0   0        0   0
auch     3        0   0  0         0  0   0   0        1   0

Computing association strength

We use Fisher’s Exact Test and the phi coefficient (φ) to measure the statistical association between each German–English word pair, controlling for marginal frequencies — the same approach used in keyword analysis and collocation research.

Code
coocdf <- as.data.frame(as.matrix(tdm))
cooctb <- coocdf |>
  dplyr::mutate(German = rownames(coocdf)) |>
  tidyr::gather(
    English, TermCoocFreq,
    colnames(coocdf)[1]:colnames(coocdf)[ncol(coocdf)]
  ) |>
  dplyr::mutate(
    German  = factor(German),
    English = factor(English)
  ) |>
  dplyr::mutate(AllFreq = sum(TermCoocFreq)) |>
  dplyr::group_by(German) |>
  dplyr::mutate(TermFreq = sum(TermCoocFreq)) |>
  dplyr::ungroup() |>
  dplyr::group_by(English) |>
  dplyr::mutate(CoocFreq = sum(TermCoocFreq)) |>
  dplyr::arrange(German) |>
  dplyr::mutate(
    a = TermCoocFreq,
    b = TermFreq - a,
    c = CoocFreq - a,
    d = AllFreq - (a + b + c)
  ) |>
  dplyr::mutate(NRows = nrow(coocdf)) |>
  dplyr::filter(TermCoocFreq > 0)

German

English

TermCoocFreq

AllFreq

TermFreq

CoocFreq

a

b

c

d

NRows

ab

departing

1

3,504

5

5

1

4

4

3,495

215

ab

is

1

3,504

5

116

1

4

115

3,384

215

ab

the

1

3,504

5

125

1

4

124

3,375

215

ab

train

1

3,504

5

16

1

4

15

3,484

215

ab

when

1

3,504

5

27

1

4

26

3,473

215

abend

evening

1

3,504

2

2

1

1

1

3,501

215

abend

good

1

3,504

2

16

1

1

15

3,487

215

allem

döner

1

3,504

5

10

1

4

9

3,490

215

allem

everything

1

3,504

5

5

1

4

4

3,495

215

allem

one

1

3,504

5

30

1

4

29

3,470

215

allem

please

1

3,504

5

111

1

4

110

3,389

215

allem

with

1

3,504

5

22

1

4

21

3,478

215

alles

all

1

3,504

6

5

1

5

4

3,494

215

alles

for

1

3,504

6

93

1

5

92

3,406

215

alles

no

1

3,504

6

7

1

5

6

3,492

215

Extracting the best translation candidates

We compute Fisher’s Exact Test and the phi coefficient for each word pair, retain only pairs where observed co-occurrence exceeds expected (genuine positive associations), and rank by phi.

Code
translationtb <- cooctb |>
  dplyr::rowwise() |>
  dplyr::mutate(
    p  = round(as.vector(unlist(
      fisher.test(matrix(c(a, b, c, d), ncol = 2, byrow = TRUE))[1])), 5),
    x2 = round(as.vector(unlist(
      chisq.test(matrix(c(a, b, c, d), ncol = 2, byrow = TRUE))[1])), 3)
  ) |>
  dplyr::mutate(
    phi      = round(sqrt((x2 / (a + b + c + d))), 3),
    expected = as.vector(unlist(
      chisq.test(matrix(c(a, b, c, d), ncol = 2, byrow = TRUE))$expected[1]))
  ) |>
  dplyr::filter(TermCoocFreq > expected) |>
  dplyr::arrange(-phi) |>
  dplyr::select(-AllFreq, -a, -b, -c, -d, -NRows, -expected)

German

English

TermCoocFreq

TermFreq

CoocFreq

p

x2

phi

hallo

hello

1

1

1

0.00029

875.5

0.500

abend

evening

1

2

2

0.00114

218.2

0.250

ja

yes

1

2

2

0.00114

218.2

0.250

morgen

morning

1

2

2

0.00114

218.2

0.250

tag

day

1

2

2

0.00114

218.2

0.250

guten

good

4

13

16

0.00000

201.1

0.240

brauche

need

5

20

27

0.00000

124.2

0.188

nein

no

2

9

7

0.00012

122.7

0.187

bier

beer

2

8

8

0.00013

120.8

0.186

hamburg

hamburg

2

8

8

0.00013

120.8

0.186

braucht

he

1

3

3

0.00257

96.5

0.166

braucht

medication

1

3

3

0.00257

96.5

0.166

braucht

needs

1

3

3

0.00257

96.5

0.166

deutscher

german

1

3

3

0.00257

96.5

0.166

er

he

1

3

3

0.00257

96.5

0.166

The results show that even a small parallel corpus yields reasonable translation candidates. The top-ranked pairs align well with genuine translation equivalents. Mismatches further down the ranking illustrate the need for more data to disambiguate polysemous words and handle idiomatic expressions. The approach scales directly: with a larger parallel corpus, accuracy improves substantially.


Exercises: Bilingual Dictionaries

Q1. Why is raw co-occurrence count insufficient for identifying translation equivalents, and what statistical measure does this tutorial use instead?






Generating Dictionaries for Other Languages

Section Overview

What you will learn: How to apply the same dictionary-generation pipeline to a language other than English, using German as a demonstration.

Key point: Because udpipe supports more than 60 languages, the workflow transfers directly to any supported language by simply changing the model file.

The procedure for generating dictionaries can easily be applied to languages other than English. The only change required is the udpipe language model. Here we demonstrate using a sample of the Brothers Grimm fairy tales as a German-language corpus.

Loading a German corpus

Code
grimm <- readLines("tutorials/lex/data/GrimmsFairytales.txt",
                   encoding = "latin1") |>
  paste0(collapse = " ")
# show the first 200 characters
substr(grimm, start = 1, stop = 200)
[1] "Der Froschkönig oder der eiserne Heinrich  Ein Märchen der Brüder Grimm Brüder Grimm  In den alten Zeiten, wo das Wünschen noch geholfen hat, lebte ein König, dessen Töchter waren alle schön; aber die"

Downloading and loading a German model

Code
# download German model (run once)
udpipe::udpipe_download_model(language = "german-hdt")
Code
# load German model from disk
m_ger <- udpipe_load_model(file = here::here(
  "udpipemodels",
  "german-gsd-ud-2.5-191206.udpipe"
))

Generating the German dictionary

The tagging, filtering, and summarising steps are identical to the English pipeline — only the model and input text change:

Code
grimm_ann <- udpipe::udpipe_annotate(m_ger, x = grimm) |>
  as.data.frame() |>
  dplyr::filter(!stringr::str_detect(token, "\\W")) |>
  dplyr::filter(!stringr::str_detect(token, "[0-9]")) |>
  dplyr::group_by(token, lemma, upos) |>
  dplyr::summarise(frequency = dplyr::n(), .groups = "drop") |>
  dplyr::arrange(lemma)
# inspect
head(grimm_ann, 10)
# A tibble: 10 × 4
   token           lemma           upos  frequency
   <chr>           <chr>           <chr>     <int>
 1 A               A               NOUN          1
 2 Abend           Abend           NOUN          3
 3 Abschied        Abschied        NOUN          1
 4 Ade             Ade             NOUN          2
 5 Allergnädigster Allergnädigster ADJ           1
 6 Alte            Alte            NOUN          1
 7 Angst           Angst           NOUN          1
 8 Antwort         Antwort         NOUN          1
 9 Anwesenden      Anwesende       NOUN          1
10 Anzahl          Anzahl          NOUN          1

The result is a German dictionary derived from the Grimm fairy tales, holding the word form, POS tag, lemma, and frequency — the same structure as the English dictionary. The same enrichment steps (adding sentiment, comments, translations) can be applied directly.


Going Further: Crowd-Sourced Dictionaries

Section Overview

What you will learn: How the dictionary-generation approach described in this tutorial can be extended to collaborative, crowd-sourced dictionary projects using Git and GitHub.

The dictionary-generation workflow presented in this tutorial can be extended to crowd-sourced dictionary projects. By hosting the dictionary in a Git repository on GitHub or GitLab, you can allow any researcher with an account to contribute entries or corrections.

Contributors fork the repository, make their additions or corrections, and submit a pull request. The repository owner reviews each proposed change and decides whether to accept it — maintaining quality control while enabling distributed contribution. Because Git is a version control system, any erroneously accepted change can be reverted instantly.

This is particularly well-suited to the computational lexicography workflow presented here. The R-generated dictionary provides an accurate, automatically produced starting point; the crowd-sourcing layer adds human expert review, corrections, and extensions that automated methods cannot provide. RStudio’s built-in Git integration makes this pipeline accessible without command-line expertise — see Happy Git and GitHub for the useR for a practical guide.


Citation & Session Info

Citation

Martin Schweinberger. 2026. Lexicography with R. The Language Technology and Data Analysis Laboratory (LADAL), The University of Queensland, Australia. url: https://ladal.edu.au/tutorials/lex/lex.html (Version 2026.05.01).

@manual{martinschweinberger2026lexicography,
  author       = {Martin Schweinberger},
  title        = {Lexicography with R},
  year         = {2026},
  note         = {https://ladal.edu.au/tutorials/lex/lex.html},
  organization = {The Language Technology and Data Analysis Laboratory (LADAL), The University of Queensland, Australia},
  edition      = {2026.05.01},
  doi          = {}
}
Code
sessionInfo()
R version 4.4.2 (2024-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26100)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: Australia/Brisbane
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] plyr_1.8.9      flextable_0.9.7 cluster_2.1.6   coop_0.6-3     
[5] tidytext_0.4.2  udpipe_0.8.11   stringr_1.5.1   dplyr_1.1.4    

loaded via a namespace (and not attached):
 [1] janeaustenr_1.0.0       tidyr_1.3.1             utf8_1.2.4             
 [4] generics_0.1.3          fontLiberation_0.1.0    renv_1.1.1             
 [7] xml2_1.3.6              stringi_1.8.4           lattice_0.22-6         
[10] digest_0.6.37           magrittr_2.0.3          evaluate_1.0.3         
[13] grid_4.4.2              fastmap_1.2.0           rprojroot_2.0.4        
[16] jsonlite_1.9.0          Matrix_1.7-2            zip_2.3.2              
[19] purrr_1.0.4             fontBitstreamVera_0.1.1 codetools_0.2-20       
[22] klippy_0.0.0.9500       textshaping_1.0.0       cli_3.6.4              
[25] rlang_1.1.5             fontquiver_0.2.1        tokenizers_0.3.0       
[28] withr_3.0.2             yaml_2.3.10             gdtools_0.4.1          
[31] tools_4.4.2             officer_0.6.7           uuid_1.2-1             
[34] here_1.0.1              assertthat_0.2.1        vctrs_0.6.5            
[37] R6_2.6.1                lifecycle_1.0.4         htmlwidgets_1.6.4      
[40] ragg_1.3.3              pkgconfig_2.0.3         pillar_1.10.1          
[43] data.table_1.17.0       glue_1.8.0              Rcpp_1.0.14            
[46] systemfonts_1.2.1       xfun_0.51               tibble_3.2.1           
[49] tidyselect_1.2.1        rstudioapi_0.17.1       knitr_1.49             
[52] htmltools_0.5.8.1       SnowballC_0.7.1         rmarkdown_2.29         
[55] compiler_4.4.2          askpass_1.2.1           openssl_2.3.2          
AI Transparency Statement

This tutorial was revised and expanded with the assistance of Claude (claude.ai), a large language model created by Anthropic. Claude was used to fix two deprecated function calls (mutate_each() replaced with mutate(across(...)) and str_remove_all(., "[:punct:]") corrected to str_remove_all("[[:punct:]]")), rewrite . placeholder usage for compatibility with the native |> pipe (including removing the plyr::join(., ...) call by replacing it with a two-step left_join), move library(plyr) to the setup chunk, add Learning Objectives and Prerequisite callouts, replace <div class="warning"> and <div class="question"> HTML blocks with Quarto callouts, add section overview callouts, add six checkdown exercises, expand and clarify the prose explanations throughout, standardise chunk labels, fix the BibTeX comma bug, and align the document style with other LADAL tutorials. The YAML header and all content after the Citation heading were left unchanged. All content was reviewed and approved by Martin Schweinberger, who takes full responsibility for the tutorial’s accuracy and pedagogical appropriateness.

Back to top

Back to HOME

References

Agnes, Michael, Jonathan L Goldman, and Katherine Soltis. 2002. Webster’s New World Compact Desk Dictionary and Style Guide. Hungry Minds.
Amsler, Robert Alfred. 1981. The Structure of the Merriam-Webster Pocket Dictionary. Austin, TX: he University of Texas at Austin.
Bullinaria, J. A., and J. P. Levy. 2007. “Extracting Semantic Representations from Word Co-Occurrence Statistics: A Computational Study.” Behavior Research Methods 39: 510–26. https://doi.org/https://doi.org/10.3758/bf03193020.
Firth, John R. 1957. “A Synopsis of Linguistic Theory, 1930–1955.” In Studies in Linguistic Analysis, 1–32. Oxford: Blackwell.
Levshina, Natalia. 2015. How to Do Linguistics with r: Data Exploration and Statistical Analysis. Amsterdam: John Benjamins Publishing Company.
Rajeg, Gede Primahadi Wijaya, Karlina Denistia, and Simon Musgrave. 2019. “R Markdown Notebook for Vector Space Model and the Usage Patterns of Indonesian Denominal Verbs.” https://doi.org/10.6084/m9.figshare.9970205.v1.
Steiner, Roger J. 1985. “Dictionaries. The Art and Craft of Lexicography.” Dictionaries: Journal of the Dictionary Society of North America 7 (1): 294–300. https://doi.org/https://doi.org/10.2307/3735704.